Self-adjusting trees in practice for large text collections

نویسندگان

  • Hugh E. Williams
  • Justin Zobel
  • Steffen Heinz
چکیده

Splay and randomized search trees (RSTs) are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree (BST). Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the efficiency of these trees for such vocabulary accumulation. Surprisingly, unmodified splaying and RSTs are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganization costs and show their effectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less effective. We also report the performance of efficient bit-wise hashing and red– black trees for comparison. Copyright  2001 John Wiley & Sons, Ltd.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self - Indexing Based on LZ 77 ? Sebastian

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...

متن کامل

Self-Index Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...

متن کامل

MUCK: A toolkit for extracting and visualizing semantic dimensions of large text collections

Users with large text collections are often faced with one of two problems; either they wish to retrieve a semanticallyrelevant subset of data from the collection for further scrutiny (needle-in-a-haystack) or they wish to glean a high-level understanding of how a subset compares to the parent corpus in the context of aforementioned semantic dimensions (forestfor-the-trees). In this paper, I de...

متن کامل

Self-Indexing XML

Self-indexing is a technology that integrates text compression and text indexing, such that a text collection can be simultaneously compressed and indexed. The resulting representation, called a self-index of the text, takes space close to that of the compressed text, is able of reproducing any text substring, and oers indexed searching of the collection. This has been a major breakthrough in t...

متن کامل

An Evaluation of Self-adjusting Binary Search Tree Techniques

Much has been said in praise of self-adjusting data structures, particularly self-adjusting binary search trees. Self-adjusting trees are most suited to skewed key-access distributions as the techniques attempt to place the most commonly accessed keys near the root of the tree. Theoretical bounds on worst-case and amortized performance (i.e. performance over a sequence of operations) have been ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Softw., Pract. Exper.

دوره 31  شماره 

صفحات  -

تاریخ انتشار 2001